feat: Implement PDF Report Streaming from Storage#95
Conversation
WalkthroughThe changes implement PDF caching for report downloads, storing generated PDFs at a deterministic path based on report_id and serving cached copies without regeneration. A cleanup helper function and related cleanup logic are removed. The reports output directory path is updated from reports_output to storage/reports. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Possibly related PRs
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
backend/app/api/v1/report_download.py (2)
1-1: Remove unusedtempfileimport.The PR description states that dependencies on
tempfilewere removed, and indeed the code no longer uses temporary files. This import should be removed.Apply this diff:
-import tempfile import os
105-105: Remove unusedbackground_tasksparameter.The
background_tasksparameter is declared but never used in the function. Since PDFs are now cached indefinitely without cleanup, this parameter serves no purpose.Apply this diff:
async def get_report_pdf( report_id: str, - background_tasks: BackgroundTasks, db_session: AsyncSession = Depends(get_db),And remove the unused import:
-from fastapi import APIRouter, Depends, HTTPException, status, BackgroundTasks +from fastapi import APIRouter, Depends, HTTPException, status
🧹 Nitpick comments (1)
backend/app/api/v1/report_download.py (1)
135-146: No cleanup strategy for accumulated PDFs.PDFs are cached indefinitely without any cleanup mechanism. Over time, this will consume increasing amounts of disk space, especially if reports are frequently generated or if old reports are no longer needed.
Consider implementing one of the following strategies:
- LRU cache with size limits: Remove least recently accessed PDFs when storage reaches a threshold
- Time-based expiration: Delete PDFs older than a configured retention period (e.g., 30 days)
- On-demand regeneration: Add a query parameter or separate endpoint to force regeneration, allowing clients to refresh stale reports
You might also want to add monitoring/alerting for disk usage on the
REPORT_OUTPUT_DIRvolume.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
backend/app/api/v1/report_download.py(1 hunks)backend/app/core/config.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.8)
backend/app/core/config.py
54-54: Undefined name Path
(F821)
| if pdf_filepath.exists(): | ||
| logger.info(f"Serving existing PDF report for report_id: {report_id}") | ||
| return FileResponse( | ||
| path=pdf_filepath, | ||
| media_type="application/pdf", | ||
| filename=f"report_{report_id}.pdf", | ||
| status_code=status.HTTP_200_OK | ||
| ) | ||
|
|
||
| logger.info(f"Generating new PDF report for report_id: {report_id}") | ||
| html_content = render_report_html(final_report_json) | ||
|
|
||
| pdf_path = None | ||
| try: | ||
| # Offload blocking PDF generation to a thread pool | ||
| loop = asyncio.get_event_loop() | ||
| pdf_file = await loop.run_in_executor( | ||
| None, # Use default thread pool executor | ||
| lambda: tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") | ||
| ) | ||
| pdf_path = pdf_file.name | ||
| await loop.run_in_executor( | ||
| None, | ||
| lambda: HTML(string=html_content).write_pdf(pdf_path) | ||
| lambda: HTML(string=html_content).write_pdf(pdf_filepath) | ||
| ) |
There was a problem hiding this comment.
Race condition: concurrent PDF generation for the same report.
When multiple requests arrive simultaneously for a report that doesn't have a cached PDF, all requests will pass the pdf_filepath.exists() check and attempt to generate the PDF concurrently. This can lead to:
- Corrupted PDF files due to simultaneous writes
- Wasted compute resources
- Inconsistent user experience
Consider using file-based locking or an in-memory lock (e.g., using asyncio.Lock stored in a dictionary keyed by report_id) to ensure only one request generates the PDF at a time, while others wait for the generation to complete.
Example approach using asyncio locks:
# At module level
_pdf_generation_locks: dict[str, asyncio.Lock] = {}
# In get_report_pdf function, before checking if PDF exists:
if report_id not in _pdf_generation_locks:
_pdf_generation_locks[report_id] = asyncio.Lock()
async with _pdf_generation_locks[report_id]:
# Re-check existence after acquiring lock
if pdf_filepath.exists():
logger.info(f"Serving existing PDF report for report_id: {report_id}")
return FileResponse(...)
# Generate PDF...🤖 Prompt for AI Agents
In backend/app/api/v1/report_download.py around lines 139 to 157, multiple
concurrent requests can pass the pdf_filepath.exists() check and start PDF
generation simultaneously; add a per-report asyncio lock to serialize
generation: create a module-level dict mapping report_id to asyncio.Lock,
acquire the lock before re-checking pdf_filepath.exists(), return the
FileResponse if rediscovered, otherwise generate the PDF inside the lock (using
the existing executor call), then release the lock and optionally remove the
lock from the dict to avoid growth; this ensures only one coroutine writes the
PDF while others wait and then serve the cached file.
|
|
||
| BASE_DIR: Path = Path(__file__).parent.parent.parent | ||
| REPORT_OUTPUT_DIR: Path = BASE_DIR / "reports_output" | ||
| REPORT_OUTPUT_DIR: Path = BASE_DIR / "storage" / "reports" |
There was a problem hiding this comment.
Critical: Missing import for Path.
The Path class is used on lines 53-54 but is not imported. This will cause a NameError at runtime when the Settings class is instantiated.
Add the missing import at the top of the file:
from pydantic_settings import BaseSettings, SettingsConfigDict
from typing import Dict, List
+from pathlib import Path🧰 Tools
🪛 Ruff (0.14.8)
54-54: Undefined name Path
(F821)
🤖 Prompt for AI Agents
In backend/app/core/config.py around line 54, the Settings class uses Path
(REPORT_OUTPUT_DIR) but Path is not imported, causing a NameError; add the
missing import by importing Path from the pathlib module at the top of the file
(for example: from pathlib import Path) so the REPORT_OUTPUT_DIR assignment
resolves correctly.
Overview
This PR updates the report download functionality to stream pre-generated PDF reports directly from a designated storage directory, rather than generating them on-the-fly from HTML.
Changes
PDF_REPORT_STORAGE_PATHsetting tobackend/app/core/config.pyto specify the location of stored PDF reports (e.g.,storage/reports).get_report_pdfendpoint inbackend/app/api/v1/report_download.pyto useFileResponseto serve existing PDF files.report_idand the newPDF_REPORT_STORAGE_PATHsetting.weasyprintandtempfilefor PDF generation inreport_download.pysince files are now streamed directly.Summary by CodeRabbit
New Features
Chores
✏️ Tip: You can customize this high-level summary in your review settings.